To process the raw sequence data we use dada2 at MiCA. DADA2 infers exact amplified sequence variants (ASV). It is possible to get single nucleotide resolution allowing for the highest possible phylogenetic resolution with the current technologies.
For quality assurance we include positive and negative control samples in each pooled amplicon libraries The positive control consists of a MOCK community sample for which the composition is known. In this case the sample was composed of DNA from 55 different bacterial strains. More details about the mock samples can be found at https://f1000research.com/articles/5-1791/v1 As a show case we will process one of these mock libraries
First we install and load the necessary packages.
# load libraries
.cran_packages <- c("ggplot2", "reshape2")
.bioc_packages <- c("dada2", "phyloseq","ShortRead","decontam")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
install.packages(.cran_packages[!.inst])
}
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
source("http://bioconductor.org/biocLite.R")
biocLite(.bioc_packages[!.inst], ask = F)
}
# Load packages into session, and print package version
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE, quietly = T)
## ggplot2 reshape2 dada2 phyloseq ShortRead decontam
## TRUE TRUE TRUE TRUE TRUE TRUE
To start the workshop we process the sequence data of one of the positive controls. First we download the and asses the quality of the run.
FQF <- "./Raw_data/L3_MOCK1.F.fastq.gz"
FQR <- "./Raw_data/L3_MOCK1.R.fastq.gz"
#visualize the quality of the data
plotQualityProfile(c(FQF,FQR))
So we can see the sample has 74333 reads. Furthermore, the quality of the reverse read is much poorer than that of the forward read.
A bit more intuitive is to look at the expected error rather than looking at the mean quality distribution.
x <- cbind(as.matrix(PhredQuality(quality(readFastq(FQF)))),as.matrix(PhredQuality(quality(readFastq(FQR)))))
#calculate cumalative error rates
x <- t(apply(10^(-x/10), 1, cumsum))
# Error rate statistics
plot(density(x[,500]), xlab="Expected Errors")
summary(x[,500])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2137 4.4166 7.2537 8.1537 11.0237 34.6272
This shows that on average a read has 8 expected errors. This results in many unique sequences.
drrut <- derepFastq(FQF)
drrut
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 59118 unique sequences
## Sequence lengths: min=251, median=251, max=251
## $quals: Quality matrix dimension: 59118 251
## Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences: 53754 5696 9443 25284 17014 ...
sum(table(drrut$map)==1)
## [1] 56871
drfut <- derepFastq(FQR)
drfut
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 68629 unique sequences
## Sequence lengths: min=251, median=251, max=251
## $quals: Quality matrix dimension: 68629 251
## Consensus quality scores: min=7, median=31, max=38
## $map: Map from reads to unique sequences: 38277 22800 68505 24182 30459
## ...
sum(table(drfut$map)==1)
## [1] 67474
From the reverse read 67474 out of 74333 (90%) are unique! This is an issue for amplicon inference because dada2 can only infer ASVs from duplicated reads.
Luckily we can remove some of the poor quality regions of the reads to alleviate this issue somewhat. The length of the V3V4 amplicons are between 400 and 430 bases, which means there is a 100 to base pair overlap between the forward and reverse read. In order to merge the ends we need at least a 20 bp overlap, and thus can trim a total of 50 bases of the ends. We found that trimming 10 bases from the forward read and 40 from the reverse is optimal, and thus we trim the forward and reverse reads to 240 and 210 respectively.
Furthermore DADA2 cannot handle sequences containing ambiguous nucleotides and therefore these are filtered out (maxN=0). In contrast to the dada2 tutorial we do not filter on expected errors. We have found that sequence quality differs between amplicons and therefore quality filtering will skew the final composition.
# filtered output
FQFF <- "./Output/FQFF.gz"
FQRF <- "./Output/FQRF.gz"
out <- filterAndTrim(FQF, FQFF, FQR, FQRF ,maxN=0, truncQ=0, rm.phix=T, compress=TRUE, multithread=F, truncLen=c(240,210))
out
## reads.in reads.out
## L3_MOCK1.F.fastq.gz 74333 74329
There are no problems with filtering and only four sequences were removed.
These sequences were actually Illumina control spike in phiX sequences which are incorrectly assigned to the sample during demultiplexing of the libraries. This issue is known as cross talk. This is not specific for phiX but also between samples. Though in general the rate of cross talk is low, this specifically impacts presence/absence type of analysis.
x <- cbind(as.matrix(PhredQuality(quality(readFastq(FQFF)))),as.matrix(PhredQuality(quality(readFastq(FQRF)))))
#calculate cumalative error rates
x <- t(apply(10^(-x/10), 1, cumsum))
# Error rate statistics
plot(density(x[,450]), xlab="Expected Errors")
summary(x[,450])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08024 2.35631 4.54074 5.48335 7.72546 28.87945
drf <- derepFastq(FQFF)
drf
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74329 reads in 55651 unique sequences
## Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension: 55651 240
## Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences: 50528 5875 9479 24068 16529 ...
sum(table(drf$map)==1)
## [1] 53124
drr <- derepFastq(FQRF)
drr
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74329 reads in 55205 unique sequences
## Sequence lengths: min=210, median=210, max=210
## $quals: Quality matrix dimension: 55205 210
## Consensus quality scores: min=7, median=33, max=38
## $map: Map from reads to unique sequences: 31496 173 55083 21297 25824 ...
sum(table(drr$map)==1)
## [1] 52819
This reduced the mean expected error rate to 5.5 but more importantly the number of singleton reads dropped to 53124 (71 %)
The second step in the process is the estimation of the batch specific error rates. Use of the quality score to test the validity of a sequence is what sets dada2 apart from the other platforms.
Error estimation is somewhat computational demanding and therefore precalculated error rates for this particular sequence run is supplies. Otherwise error rates can be obtained using learnErrors (eg errF <- learnErrors(FQFF, multithread=F)). The following plots can be used to asses the error models and should show linear decrease in substitution rates with quality score.
errF <- readRDS("./Raw_data/L3_dada2.errF.RDS")
plotErrors(errF, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
errR <- readRDS("./Raw_data/L3_dada2.errR.RDS")
plotErrors(errR, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
Points show the quality score dependent substitution rates while the red shows the calculated rates. Results show that overall substitution rates are even higher than what can be expected from the quality score but in general follow the trend.
The third step in the dada2 workflow is the inference of the ASVs.
asv.f <- dada(derep = drf, err = errF)
## Sample 1 - 74329 reads in 55651 unique sequences.
asv.f
## dada-class: object describing DADA2 denoising results
## 216 sequence variants were inferred from 55651 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
asv.r <- dada(derep = drr, err = errR)
## Sample 1 - 74329 reads in 55205 unique sequences.
asv.r
## dada-class: object describing DADA2 denoising results
## 133 sequence variants were inferred from 55205 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
sum(is.na(asv.f$map[drr$map]) | is.na(asv.r$map[drf$map]))
## [1] 8005
saveRDS(asv.f,"./Output/asvf.RDS")
saveRDS(asv.r,"./Output/asvr.RDS")
dada2 inferred 216 amplicon sequences from the forward reads and 133 from the reverse reads. Furthermore dada2 did not find an suitable ASV representative for 8005 (10%) of the reads.
To get a single representative sequence ASV corresponding to each read pair are merged. Only those pairs that can be joined with at least a 20 bases overlap and without any mismatches are accepted.
asv.m <- mergePairs(dadaF = asv.f, derepF = drf, dadaR = asv.r, derepR = drr, returnRejects = T)
## Duplicate sequences in merged output.
head(asv.m)
## sequence
## 1 TGAGGAATATTGGTCAATGGACGCAAGTCTGAACCAGCCATGCCGCGTGCAGGATGACGGCTCTATGAGTTGTAAACTGCTTTTGTACGAGGGTAAACGCAGATACGTGTATCTGTCTGAAAGTATCGTACGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGATAAGTTAGAGGTGAAATTTCGGGGCTCAACCCTGAACGTGCCTCTAATACTGTTGAGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAACTATATCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAACA
## 2 TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGGAGGAAGAAGGTCTTCGGATTGTAAACTCCTGTTGTTGAGGAAGATAATGACGGTACTCAACAAGGAAGTGACGGCTAACTACGTGCCAGCAGCCGCGGTAAAACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAATCTATGGGCTCAACCCATAAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCGGTGTAGCGGTGGAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCCTACTGGGCACCAACTGACGCTGAGGCTCGAAAGTGTGGGTAGCAAACA
## 3 TCGAGAATCATTCACAATGGGGGAAACCCTGATGGTGCGACGCCGCGTGGGGGAATGAAGGTCTTCGGATTGTAAACCCCTGTCATGTGGGAGCAAATTAAAAAGATAGTACCACAAGAGGAAGAGACGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGCGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCGGTGTAGCAGTGAAATGCGTAGATATCGAGAGGAACACTCGTGGCGAAGGCGGGTTCCTGGACATTAACTGACGCTGAGGCACGAAGGCCAGGGGAGCGAAAG
## 4 TGAGGAATATTGGTCAATGGCCGAGAGGCTGAACCAGCCAAGTCGCGTGAGGGATGAAGGTTCTATGGATCGTAAACCTCTTTTATAAGGGAATAAAGTGCGGGACGTGTCCCGTTTTGTATGTACCTTATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTTTTAAGTCAGCGGTGAAAGTCTGTGGCTCAACCATAGAATTGCCGTTGAAACTGGGGGGCTTGAGTATGTTTGAGGCAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTGCCAAGCCATTACTGACGCTGATGCACGAAAGCGTGGGGATCAAACA
## 5 TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGAAGAAGTATCTCGGTATGTAAACTTCTATCAGCAGGGAAGATAGTGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTGGCAAGTCTGATGTGAAAGGCATGGGCTCAACCTGTGGACTGCATTGGAAACTGTCATACTTGAGTGCCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
## 6 TGAGGAATATTGGTCAATGGGCGCAGGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGTTTTCCACGTGTGGAATTTTGTATGTACCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGCTGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGACTGCAACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 5532 3 1 28 0 0 2 TRUE
## 2 5147 1 2 48 0 0 1 TRUE
## 3 3919 2 4 42 0 0 1 TRUE
## 4 2372 5 5 28 0 0 2 TRUE
## 5 2018 4 6 48 0 0 1 TRUE
## 6 1869 6 3 28 0 0 2 TRUE
saveRDS(asv.m,"./Output/asvm.RDS")
This table shows the statistics for all possible ASV pairs. Forward and reverse read mismatching is a common problem of Illumina sequencing. Therefore significant amount of reads are lost in this process.
length(asv.m$accept)
## [1] 4586
sum(asv.m$accept)
## [1] 755
#sum(asv.m[asv.m$accept,]$abundance)
sum(asv.m[!asv.m$accept,]$abundance)
## [1] 11726
Out of 4586 possible combinations 755 valid merged ASVs are generated. A total of 11726 (15%) reads belong to read pairs which do not properly merge. In total 25 % of the input signal has been dropped in the process.
During PCR and bridge amplification potential chimeric sequences are generated. These hybrid sequences generate artificial links in the pylogenetic relation of the ASV and cloud analysis. Therefore the ASVs are screened for chimeras. V3V4 amplicons have a conserved region in the middle and are therefore more prone to chimera formation.
seqtab <- makeSequenceTable(asv.m)
## Duplicate sequences detected and merged.
bimeras <- isBimeraDenovo(seqtab)
sum(bimeras)
## [1] 655
sum(na.omit(asv.m$abundance[asv.m$bimera]))
## [1] 0
asv.m$bimera <- bimeras[asv.m$sequence]
asv.m$accept2 <- asv.m$accept & !asv.m$bimera
asv.valid.seqs <- asv.m$sequence[asv.m$accept2]
length(asv.valid.seqs)
## [1] 100
sum(seqtab[1,][asv.valid.seqs])
## [1] 44112
sum(asv.m[asv.m$sequence %in% asv.valid.seqs,]$abundance)
## [1] 44112
seqtab.nochim <- seqtab[,!bimeras]
A total of 655 ASVs were discarded as possible chimeric sequences, representing 11598 reads (15%)
In the end a total of 100 valid ASVs were inferred representing 44112 reads (57%). This means we lost 43% of the data in processing of the reads. These are quite common statistics.
To determine the accuracy and precision we compare the valid ASVs to to the sequences of the 55 reference strains. Unfortunately the reference sequences for each member have been obtained by Sanger sequencing. Due to this no isoform information is available and the reference sequences contain some ambiguous basecalls. Lets consider all distances smaller than 5 as possible true positive isoforms.
# Read in the sequences of the mock
mockref <- as.character(readDNAStringSet("./Raw_data/Mock_reference_NoN.fasta"))
# Apperently the 16S sequences of the members in the mock are not unique!
names(mockref)[mockref==names(which(table(mockref)!=1))]
## [1] "MC_11 Bacteroides MC_11" "MC_16 Bacteroides MC_16"
# Remove duplicated reference
mockref <- mockref[!duplicated(mockref)]
# Get the hamming distance for each ASV and Mock member
# First build a dataframe
hamming.distance.mat <- matrix(nrow = length(asv.valid.seqs), ncol = length(mockref), NA)
colnames(hamming.distance.mat) <- mockref
rownames(hamming.distance.mat) <- asv.valid.seqs
hamming.distance.mat.long <- setNames(melt(hamming.distance.mat), c('asv', 'mock', 'dist'))
# Calculate the hamming distance between all ASVs and the mock
hamming.distance.mat.long$dist <- nwhamming(as.character(hamming.distance.mat.long$asv),as.character(hamming.distance.mat.long$mock))
# Get the closest relative for each mock and asv
mockrefdist <- aggregate(dist ~ mock, data = hamming.distance.mat.long, min)
asvrefdist <- aggregate(dist ~ asv, data = hamming.distance.mat.long, min)
asvrefdist$name <- factor(paste0("ASV_",1:100), levels=paste0("ASV_",1:100))
mockrefdist$name <- factor(names(mockref)[match(mockref, mockrefdist$mock)], levels=names(mockref))
ggplot(mockrefdist, aes(x=name, y=dist)) +
geom_bar(stat="identity") +
coord_flip() +
labs(y="Minimum hamming distance", title="Distance mock to representive asv (Recall)", x=NULL)
ggplot(asvrefdist, aes(x=name, y=dist)) +
geom_bar(stat="identity") +
coord_flip() +
ggplot2::geom_hline(yintercept = 5, col='red') +
labs(y="Minimum hamming distance", title="Distance asv to mock representive (Precision)", x=NULL)
# False positive ASVs
sum(asvrefdist$dist>=6)
## [1] 30
# False positive ASVs total reads
sum(seqtab.nochim[asvrefdist$dist>=6])
## [1] 2083
This shows that all but one member has a valid representative ASV. There are 30 false positive ASVs, representing 2083 reads (3.7%). These are usually stealthy chimeras or contaminants
Overall we went from more than 68629 unique sequence to comprehensible set of only 100 representative sequences.
While most reagents used for library preparation are sterile, trace amounts of contaminant DNA can be found in almost all of them. If, for example, DNA yields are a confounding factor in the experimental design, these microbes can easily be identified as possible biomarkers. It is best practice to identify these and remove them from the data. For this we use the package decontam also develop by the the same persons developing dada2.
There are two approaches to detecting contaminants. The first is the inclusion of negative controls (NC) and testing the prevalence of specific ASVS on those samples compared to the biological samples. The problem here is you need multiple NC samples to get statistical significance and we rather sequence real samples.
Another more useful approach is to identify them based on frequency. Here a negative association with PCR yields is used to identify contaminants. The following code uses decontam to identify possible contaminants. The relations are visualized to validate the contaminant classifications.
For more information visit: https://benjjneb.github.io/decontam/vignettes/decontam_intro.html
# readin the phyloseq object
ps <- readRDS("Raw_data/ps.2018_0008_00.2019-01-11.mod.RDS")
# transform sample counts to relative abundance
pst <- transform_sample_counts(ps, function(x) x/sum(x))
#detect contaminats
dco <- isContaminant(ps, conc = "Nucl_Acid_AMP_Conc_RS")
#putative contaminats list
putative.contaminants <- rownames(dco)[dco$contaminant]
#total number of putative contaminants
length(putative.contaminants)
## [1] 93
#Taxonomy of possible contaminants
ps@tax_table[putative.contaminants,6]
## Taxonomy Table: [93 taxa by 1 taxonomic ranks]:
## Genus
## ASV_0102 "Coprococcus_3"
## ASV_0609 "Marvinbryantia"
## ASV_0663 "Blautia"
## ASV_0782 "Marvinbryantia"
## ASV_0908 "Lachnospira"
## ASV_1005 NA
## ASV_1116 "Lachnospiraceae_UCG-010"
## ASV_1180 "Christensenellaceae_R-7_group"
## ASV_1198 "Christensenellaceae_R-7_group"
## ASV_1238 "Tumebacillus"
## ASV_1627 "Veillonella"
## ASV_1698 "Negativicoccus"
## ASV_1708 NA
## ASV_1786 NA
## ASV_1824 "S5-A14a"
## ASV_1902 "Romboutsia"
## ASV_1936 "Lactobacillus"
## ASV_2002 "Lactobacillus"
## ASV_2029 "Globicatella"
## ASV_2051 "Dolosigranulum"
## ASV_2055 "Aerococcus"
## ASV_2059 "Aerococcus"
## ASV_2124 "Staphylococcus"
## ASV_2141 "Staphylococcus"
## ASV_2152 "Staphylococcus"
## ASV_2192 "Streptococcus"
## ASV_2370 NA
## ASV_2405 "Methylobacterium"
## ASV_2406 "Rhodopseudomonas"
## ASV_2407 "Bradyrhizobium"
## ASV_2408 "Bradyrhizobium"
## ASV_2409 "Afipia"
## ASV_2410 "Haematobacter"
## ASV_2411 "Paracoccus"
## ASV_2413 "Enhydrobacter"
## ASV_2471 "Escherichia/Shigella"
## ASV_2482 "Escherichia/Shigella"
## ASV_2612 "Neisseria"
## ASV_2613 "Neisseria"
## ASV_2615 "Lautropia"
## ASV_2672 "Sutterella"
## ASV_2678 "Aquabacterium"
## ASV_2679 "Curvibacter"
## ASV_2683 "Cupriavidus"
## ASV_2684 "Ralstonia"
## ASV_2915 "Actinomyces"
## ASV_2963 "Actinomyces"
## ASV_2966 "Actinomyces"
## ASV_3013 "Pseudoclavibacter"
## ASV_3019 "Rothia"
## ASV_3036 "Gordonia"
## ASV_3045 "Corynebacterium_1"
## ASV_3050 "Corynebacterium_1"
## ASV_3098 "Corynebacterium_1"
## ASV_3108 "Corynebacterium_1"
## ASV_3109 "Corynebacterium_1"
## ASV_3110 "Corynebacterium_1"
## ASV_3116 "Corynebacterium_1"
## ASV_3159 "Corynebacterium_1"
## ASV_3187 "Corynebacterium"
## ASV_3190 "Corynebacterium_1"
## ASV_3306 "Rikenellaceae_RC9_gut_group"
## ASV_3721 "Prevotella_6"
## ASV_3880 "Prevotella_9"
## ASV_4030 "Bacteroides"
## ASV_4200 "Bacteroides"
## ASV_4239 "Bacteroides"
## ASV_4437 "Parabacteroides"
## ASV_4738 NA
## ASV_4750 NA
## ASV_4753 NA
## ASV_4803 "Erysipelotrichaceae_UCG-003"
## ASV_5045 "Peptoniphilus"
## ASV_5308 "Anaerococcus"
## ASV_5433 "Helcococcus"
## ASV_5434 "Helcococcus"
## ASV_5707 "Fastidiosipila"
## ASV_5730 "Ruminococcaceae_UCG-003"
## ASV_5795 "Oscillospira"
## ASV_5848 "Ruminiclostridium"
## ASV_5914 "Intestinimonas"
## ASV_5947 NA
## ASV_6117 NA
## ASV_6162 "Butyricicoccus"
## ASV_6810 "Ethanoligenens"
## ASV_6835 NA
## ASV_6932 "Ruminococcus_1"
## ASV_7050 "Ruminococcaceae_UCG-014"
## ASV_7106 NA
## ASV_7126 NA
## ASV_7219 "Lachnospiraceae_NK4A136_group"
## ASV_7231 "Lachnospiraceae_NK4A136_group"
## ASV_7292 "Lachnospiraceae_UCG-004"
# Many of them are spurious hits and are difficult to judge
# Filter for prevelance more than 10% samples and relative abundance at least 1% in one sample
prevalance <- colSums(pst@otu_table[,putative.contaminants]!=0)/nsamples(ps)>0.1
relative_abun <- apply(pst@otu_table[,putative.contaminants],2,max)>0.01
putative.contaminants <- putative.contaminants[prevalance & relative_abun]
length(putative.contaminants)
## [1] 16
#get a data frame with relative abundance for the putative contaminants
df <- cbind(ps@sam_data, pst@otu_table[,rownames(dco)[dco$contaminant]])
#visually check the relation with amplicon yield and mark T/F TRUE/FALSE
z <- list()
x=0
y=length(putative.contaminants)
# This is an interactive part of the script; run it in the console! *(uncomment using CTRL+SHIFT+c)
# Well call a possible contaminants
for(ASV in putative.contaminants){
x=x+1
tax=NA
rank=6;while(is.na(tax) & rank>2){tax <- as.character(ps@tax_table[ASV,rank]); rank=rank-1}
p <- ggplot(df, aes_string(x="Nucl_Acid_AMP_Conc_RS", y=ASV, color="i7")) +
geom_point() +
scale_y_log10() +
scale_x_log10() +
labs(title=tax) +
geom_smooth(aes(group=Seq_ID),method = "lm")
suppressWarnings(plot(p))
#z[[ASV]] <- readline(paste0("ASV ",x, ":", y, " is contaminant T/F: "))
}
Most of the abundant ASVs were correctly identified as kit contaminants but some, like Corynebacterium might associate with environments that results in less amplicon yields. You be the judge if these are true contaminants or not. Don’t blindly trust the outcome of these bioinformatic tools!